DOCOMENT SESOHE 

ED ICS 422 CS 001 750 



AUTHO£ 
TITLE 

PUB DATE 
NOTE 



EDBS FBICE 
DESCRIPTORS 



Athey, Irene; 0*Beilly, Robert P. 

A Criterion^Beferenced Testing Hodel for Assessing 
Grovth in Beading. 
Har 75 

lOp.; Paper presented at the Annual Heeting of the 
Mat.^onal Coancil for Heasurement in Education 
(Hashingtc^r D.C,, Harch^ 1975) 

HP-$0.76 HC-|1*58 FLOS POSTAGE 

"^Criterion Referenced Tests; "^Evaluation Methods; 
Higher Education; Beading Achievement; ^Beading 
Development; Beading Improvement; ^Beading Besearch; 
Beading Skills: ^Test Construction 



ABSTRACT 

This study vas designed to investigate the 
construction of a nev measure for assessing reading performance vhich 
vould prove to be a more sensitive and useful instrument for 
measuring the achievement grovth induced by reading programs than the 
usual standardized tests. This report outlines the procedures 
folloved in generating the model and constructing the tests, and 
presents some preliminary data collected in the course of field 
testing the nev instrument. The hypothesis that the 

criterion-referenced tests vould prpve to be more sensitive measures 
of reading improvement over a lengthy period of time vhen compared to 
norm- referenced tests vas partially confirmed. The results of the 
study are presented in narrative and t&Dle form. (BB) 



ERIC 



A Criterion-Referenced Testing Model 
for Assessing Gra\rth in Readiiig 

IrcKC Athoy ^'^^^ ^^^^"^^ Rcbert P. O'Reilly 

Robert P. O'ReiHy 

UniversitA' of Pochoster State Bducction Department 
Rochester, Net/ York ■„,;''./■....■• ...■""!■'.;;:■■::. Albciny, New York 

A »*t ''V * * < •••• " 

The current pressures for increased productivity and efficier^ in education 
has led, in recent years, to the sciarch for new evaluation procedures which are 
Cc^jable of measuring the outcones of sdiool instruction v/ith more precision than 
tTcKlitional standiirdiscd tests. Specif iccilly, the most urgent need is for 
measures \vtiich are relevant to a school systesn's program objectives, and v*iich 
can provide information useful in making decisions abcxit the merits of its 
programs. 

This study was designed to investigate the possibility of constructing a 
new measure of reading perfonuance v.liich, when administered in a longitudinal 
matrix-sairpling design, might prove tc };o a more sensitive and useful instru- 
ment for measuring the achieve%ient gra>rtii induced by reading programs than the 
usual standardized ijosts. iMs report v;ill outline the procedures folla/od in 
generating the model cuid constructing the tests, and will present scrie preli- 
minary' analyses of the data collected in the course of field-testing the nw 
instrunent. 

The initial source of ir\put to the test d^veloprvent process was provided 
by the Bank of leading Objectives (BPD) corpiled and organized ever tlie period 
of a year by a team of reading researdicrs ana curriculum cxpeirts under tlie 
auspices of liie Bureau of School and Coltural ilesecirch of tlie tloii/ York State 
DJucation Dc{>art[i>cnt. During the course of this enterprise, mciny reading 



us DEPARTMENT OF NEALTN 
EDUCATION 4 ME Lf ARE 
NATIONAL INSTITUTE DF 
EDUCATION 

tMJN ^^HU^'^C^T mas BFIM REPKO 
OU(FO tXACTiv AS kkCHvlO ^HOM 
THE PEftbCS OH OMOAMi/AflOMOHlGlM 
AT IMG IT MOlMISOF V E W 0() OP* MiOMS 
SfATEO 00 MOT MECISSAMiiv R€P«F 
SENT Off K lAL MATiOMAl jM'^TtTUTE Of 
EOuCatiON position or POttCv 



Paper prci'icntcxl at tlio cjuiual conference of the National Council for 
NJeasuraiKint in lilucation, \1a:>hixr}ton, D. C. , Mardi 31, 1975. 



systans were ex^nincd, but the final procluct of sor^ 2000 c±>jcctives borrcxved 
heavily fran the SOD£R systan dovelcpod by Skager at UCLA and Cohen's reading 
system. After considercible rcvvriting cind inodification, the Bank was organized 
into six areas: i^Jultisensory Readiness, Decoding, Vocabulary, Ccrprehension, 
location and Study Skills, and Rea3ii>g in the Content Areas. In the initial 
phase of the study a decision was made to confine the construction of test items 
to two cureas, VocabuJ^y and Ccmprehension. Coimittees of teachers and reading 
specialists in each of the nine participating school districts selected fron the 
total bank those objectives in the areas of vocabulary and cotprehcnsion \*iich 
seonDd nost pertinent to their own reading programs. In addition, they indicated 
the relative orphasis vMch the district placed on the skills represented by those 
objectives, assigning to each objective the number of test itens they wished to 
have constructed. At the outset it was agreed that the total test should not 
exceed 30 minutes, so the coiiriittees v:ere under sane constraints tc^snsure that 
all iiiportant objectives v."ere indeed adequately represented. Lists of basic 
sight v.ords and of graded reading materials supplied by the school districts 
were used by the test construction team as a guide in the selection of passages 
ai>d voccibul£uy iters. 

For the pilot phase, grades 4 through 6 were selected for intensive study. 
HcvA2ver, criterion-referenced tests must be geared to the level at which studaits 
are actually adiieving, irrespective of tJieir current grade placement. Since 
the remge of achievor^t was approximately seven years Tor tliese three grades , 
seven levels of the tests, correspoixlin^ roighly to grades 1 through 7, were 
ccnstjcuctod. 

Olie longitudinal design of the study called for repeated measurancnt on 
the same subjects at rcgulcir intervals determined by tlie school district. For 
e:<anple, a district might elect to administer tl\c tav.ts five times a year at 



ERLC 



3 



3 



intxars/ols of- tv.^ noiths. If tho district or tho teachers were reluctant to 

re test so often, or at sudi short intervals, they load tlie option of confinii>g 

the testing to three aAriLnisLrations at the beginning, iriddle, and end ot the 

school year. Havever, in the pilot phase, all school districts adrrdnisterod 

the tests five tines at tvx>-wcekly interval's between Mcirch and June, 1974. To 

enable than to do so, five equivalent forms of the tests were constructed at 

each level, ard students were assigned in randon order to the various forms ^ 

in such a v/ay that ultimately every student took all five fonrc of the test 

at the level to v4iich he had been assigned by ^ds teacher, ^proximately 

4000 students constituted the sample, and they came fron school districts v/ith 

a wide range of donograpiiic chciracteristics/, including urban centers in the 

largest cities of Mew York State, and some suburban ana rural cireas. 

concurrent with the testing program, a technological support system was 

developed at a central location and implcjrkented at various locations adjacent 

to tlie participating school districts. The tests were printed and assenbled 

at the central location emd shif^xxi directly to the schools. Follaving each 

ff 

test administration, they were scored at the nearest cor^ter facility, and 
copies of the data aneilyses returned to the central location and to the 
sclKol district. Feedback to the schools consisted of iixiividual cind group 
scores on every item, every objective, and the total test. Using this infor- 
mation, the teaciier could ascertain the effectiveness of instruction on any 
specific skill fr.:in one test administration to the next, or even tlie extent 
of students' mastery of a skill prior to instruction, for e>:anple, at the 
beginning of tlie scliool year, llie data could thus serve not only to dic^xDse 
the straxjtl-is and weaknesses of particular students, but to suggest more 
effective uses of tlie available instructional tinKi both for individual cmd 
group piUTX)ses. Similar feedback v;as alrjo provided to eacli student in the 



ERIC 



j 



4 



form of a cxxipon sha'/irig his progress on each item and skill cind on the tx)tal 
test for cczh airinistration. Teaciiers have reported infomvally t±iat provision 
of this type of feedback in v.tdch students monitor t-heir cmtx progress can be 
highly motivating, and this assertion may be tested €t\pirically in a later ptiase 
of the study. 

An iji^rtant objective of the pilot study was to <±>tain feedback from the 
schools on both tlie test instrument and the conputer supporl: system. A ques- 
tionnaire v/as distributed to all cooperating teachers, in vrfiich they were en- 
couraged to critique the format and ccnposition of the tests, and to corment on 
the usefulness of the feedback in the form in which it was presented. Several 
major modifications a d ref inanents were introduced as a reisult of tliis input 
fron the scliools. First, it became apparent that conprehension was the over- 
riding concept, and that most of the skills \ ich had been designated as vocabu-- 
lary c±>jectives could readily be subsumed under caiprehension. At this time 
cin atterpt v;as also n^e to reconceptualize the notion of reading ccmprehension 
to determine v±iether all the major parameters had been included in the opera- 
tional definition constituted by the objectives viiich ha2 b€>cn ccnsensually 
validated by tlie reading teachers. Every school district, for exanple. recog- 
nized U\e oTiiportance of certain cognitive skills such ac the ability to classify 
or to designate cause and effect. On tlie otliav hand, few included tlie ability 
to process ccrplex syntactic structures, although sucli an exbility must clearly 
be related to the ccrpreliension of printed text.* Second, it was found that 
even tlioucjh the tests had been based on graded ins true tioncil materials and trade 
books, sa:>c of tlio passages and items were too difficult for the levels at v^iich 
they hciil been written. All passages used in the second round of the study were 
therefore suJjnitted to a Dale-Chall readability fomula, witli additional chocks 

ilFor furU^jr aiscussion of tJiis topic see Athcy (1975) . 



ERiC 



5 



5 



for vocabulary lev<>l in the Ilarris-Jacx±>son Basic Vford Lists • In addition, 
the questions were carefully screened to ensure that the passages nust be 
read in ordt:ir to obtain ti\o necessary infonration for answering the questions, 
ard to eLiiT^inate "keying" of one item by another. An attenpt was also made 
to sonple the universe of "real life" reading materials in a way that would 
adequately represent the various dcnains of interest for students of a given 
age range. Hiird, the original seven levels were expanded to 20 levels, two 
per grade for grades 1 through 10. This extension permits a school district 
either to test at more frequent intervals or to enjoy greater leeway in the 
selection of itans in the ccmpilation of their own tests. To facilitate this 
selection, the test itcsns are currently presented in the £<nrm of a Test 
Developnent Notebook consisting of 800 pages containing sane 4000 itens. The 
itens are grouped in about 20 content categories representing skills of 
literal and interpretive coiprehension. Thus, a typical page of the notebook 
might contain a paFsage of the length prescribed for a particular level and a 
maximum of five items representing ta>flo or three objectives. These pages can 
be renovod frcm the notebook to be repixxiuced cuxi assembled by the school. 
This orrangoTjent gives the school f*istrict considerable flexibility in assanb- 
ling its o.m test pacl:ages withcut affecting the overall research design of 
the study. 

The mjor hyix)t]iesis of the longitudinal study was that the criterion- 
referenced tests wculd prove to be more s€insitive measures of gra^ in read- 
ing achievcncnt over a long-term period tlian nom-refcrcnccd standardized tests. 
A context for examining the issue of test sensitivity was created in another 
phase of tlie study by gathering data on teaclior cuxi student diaractcilstics 
and on the quantity and quality of the instructional process. Student 
characteristics data included individual and graip sociocconanic status cind 



ERIC 



G 



6 

achiever^t and/or al^ility scx>res available fron previous years of sclicxDliixj, 
Instiuctional process data were gathered through individual interviews with 
teachers, specialists, and aides involved in the classroon and in special 
reading prograns, Ihe quantity and quality of instruction avaiVible to each 
stLxient in .iach program condition v.^e quantified. The data set included 
estiinates of tlie number of instructioneil minutes available to individual 
students under different resource conditions, personnel configurations, materials, 
equipncnt, facilities, teacher experience and training, and classroon organiza- 
tion (individual, smcill group, or large group) . The notion guiding the collec- 
tion of student and teacher characteristics data Wcis that factors under the 
control of the school \>oild provide appropriate criteria for differentiating 
between achievaient tests on the issue of ?.i?.nsitivity. Specifically, it was 
hypothesized that school factors should contribute relatively more strongly to 
changes in scores on the criterion-referenced tests. 

In examining the first results of the analyses designed to test this hypo- 
thesis, it should be borne in mind that the data are those gathered in the 
pilot fhase, before the criterion-referenced tests underwent considerable 
modification. The results should therefore be considered as highly tentative, 
and interpretations or iirplications should bp drawn in the light of that fact. 

OWo analyses v^ich were available at the time of writing, using the CRT 
and the California Achievouent Test (Reading, Total Score) as criteria in the 
regression equations at grade levels 4 and 5 are sunnuirized in Tables 1 and 2. 
At botl^ levels, the CRT proved to have substantial predictive value, with the 
percentage of white stixlents in the class ^ tlie Grade 3 PEP scores, the amount 
of individual help given by the teacher, and the number of pupils in the class 
being otlier significant predictors. Ihe multiple correlations of the 22 



ERLC 



7 



Table 1 



Coiparison of Predictive Vciriables Using Criterion--Referenced 
Test and California Achievement Test os Criteria 
.lievel 4 (N=607) 



Irkdependent Variables r wt. 

Criterion-Referenced Test 
CRT pretest 

Percentage vrfiite students 
in class 

Grade 3 PEP scares 
''c,l,,,22=-^^ 

California Ach-ievanent Test 

CRT pretest .49 .33 

Percentage of white students .47 .31 

in class 

Grade 3 PEP scores .37 .29 

''c.l...22=-^^ 



.57 .41 
.32 .29 

.31 .22 



A correlation of .09 is significant at the .05 level for N=500, 



* 



ERIC 



8 



Table 2 

Canparison of Predic±ive Variables Using Crib-irion-Peferenced 
Test and California AcSiievement Test as Criteria 
Level 5 {N=497) 



Criterion-Referenced Test 



CKT pretest 



* 

Independent Variables r vrt. 



.69 -49 



Minutes per year of .28 .13 

individual help 
by teacher 



C.1...22 



.37 .39 



California Achievannent Test 
CKT pretest 

Nurnber of pupils in class .22 .26 

Grade 3 PEP scx>res .37 .23 



c.l. . .22 



*A correlation of .09 is significant at the .05 level for N=500. 



ERLC 



9 



independent variables to the two criteria are also shewn. At the fourth 
grade level, there was no difference in the sensitivity of the CF?r and the CAT 
to the total process and characteristics variables, but at the fifth grade level, 
the correlation for the CKP criterion was substantially hi(^er (r==*76) than for 
the CAT (r=,63) • Ohis suggests that the initial hypothesis has soroe plausibility, 
and we would expect, with further refinenent of the tests, that the results 
would be even stronger in this direction. We would also hqpe that variables 
outside the instructicnal process (such as socioeconcmic status) would becare 
less evident, so that the criterion tests beoone pvrer measures of the outoone 
of the instructional process. The present study has pointed the direction 
and provided results vdiich are sufficiently enooiaraging to suggest that our 
original b^theses vrere weJLl-founded. 

References 

Athey, I. Children's understanding of syntax in relation to reading 
ooTiprehension. Paper presented at the annual .-inference of the 
International Reading Association, New York, Ma/ 1975. 



